The objective of this module is to set you up with a new local repository under version control in which you will take notes and practice coding, and also set you up to be able to push changes that you make locally to a remote copy of the repository hosted on GitHub. We also introduce the idea of reproducible research and practice implementing it using RMarkdown.
Last time, we introduced the concept of version control and looked at tools in RStudio for interfacing with git, a popular version control system. Today, we are going to put these ideas into practice as a means for fostering reproducible research.
Reproducible research refers to conducting and disseminating scientific research in such a way that makes data analysis (and scientific claims more generally) more transparent and replicable. We already have means of sharing methods and results generally, through publications, although perhaps typically in less than complete detail (!), and we can share the data on which those are based by including them in some form of online repository (e.g., via “supplementary information” that accompanies an article or by posting datasets to repositories like the Dryad Digital Repository or Figshare. But how do we share the details of exactly how we did an analysis? And how can we ensure that it is possible for us to go back, ourselves, and replicate a particular analysis or data transformation? One solution is to integrate detailed text describing a workflow and analytical source code (such as R scripts) together in the same document.
This idea of tying together narrative, logic, specific code, and data (or references to them) in a single document stems from the principle of literate programming developed by Donald Knuth. Applied to scientific practice, the concept of literate programming means documenting both the logic behind and code used for an analysis conducted with a computer program. This documentation allows researchers to return to and reexamine their own thought processes at any later time, and also allows them to share their thought processes so others can understand how an analysis was performed. The upshot is that our scholarship can be better understood, recreated, and independently verified.
This is exactly the point of an RMarkdown document.
So, how does this work?
First, Markdown is a way of styling plain text document such that it can be easily rendered into HTML or PDF files. It is based on using some simple formatting and special characters to tag pieces of text such that a parser knows how to convert the plain text into HTML or PDF. This link takes you to a description of Markdown, its syntax, and the philosophy behind it by John Gruber, Markdown’s creator. A guide to Markdown syntax used on GitHub is provided here.
RMarkdown is an extension of Markdown that allows you to embed chunks of R code and additional parsing instructions in a plain text file. Then, during the rendering or “knitting” stage, when the text file is being converted to HTML or PDF format, the output of running the embedded code can also be included.
RMarkdown uses the package {knitr} to produce files that can compile into a variety of formats, including HTML, traditional Markdown, PDFs, MS Word documents, web presentations, and others. A cheatsheet of RMarkdown syntax can be found here.
To put these ideas into practice, today we are going to create a new R project and corresponding repository that we will track with version control using git. In that project, we are going to create an RMarkdown document in which you will take notes and practice coding during class today and in for homework this week.
The information below is based on this helpful post as well as this one.
Version control allows backup of scripts and easy collaboration on complex projects. RStudio includes some valuable tools for managing projects and for implementing version control on those projects. It works well with git, an open source distributed version control system, and GitHub, a web-based git repository hosting service. It also works well with Subversion (or SVN), another popular version control system.
If you have not already done so, download and install git for your operating system and create an account on GitHub.com.
Remember, git is a piece of software running on your own computer, and that is distinct from GitHub, which is the remote repository website, and from GitHub Desktop, which is a GUI to git.
We next need to tell git (on your local computer) who you are so that when you make commits either locally or remotely, they are associated with a particular user name and email. To do this within RStudio, select Tools -> Shell…, which will open up a Terminal window. Alternatively, you can enter these commands directly in a terminal window you open yourself, rather than one opened from within RStudio.
git config --global user.email your email address git config --global user.name your GitHub username
GitHub will link any commits to the username associated with the email address used to tag your commits, even if you enter a different username here. If you use an email address that is not already associated with a GitHub account, then the entered username will appear associated with your commits.
We now need to set up RStudio to interface with git.
Git/SVNEnable version control interface for RStudio projectsCreate RSA key so you can use SSH, a secure information transfer protocol, to send and receive data from remote servers.Now, there are a couple of different routes we can take to get RStudio, git, and GitHub to work together.
RStudio’s version control features are tied to the use of Projects (which are a way of dividing work into multiple contexts, each with their own working directory). The steps required to use version control with a project vary depending on whether the project is new or existing as well as whether it is already under version control.
README file.Version ControlProject directory name should be your repo name, and you should choose a parent directory for your local copy of the repo using the Create project as a subdirectory of: text field. Then click Create Project.The remote repository will then cloned into the specified directory and RStudio’s version control features will then be available for that directory. RStudio will create an .Rproj file in the directory (with the name of your project) and a .gitignore file inside of it. This directory is now being tracked by git.
A (for “added”).Commit. You will see another window open up with the names of the files in your directory in the upper left pane. Selecting any of these will show you the contents of the file in the lower pane.Commit. A window will pop up confirming what you have just done, which you can close.You have now committed the current version of the file(s) you have added to your repository on your local computer. These files should now no longer show up on the Git tab. - Now edit one or more of your files and save those edits. The changed file should show up in the Git tab with a blue M (for “modified”) next to it.
Commit button.Again, a new window will open up showing the files in your directory that are, either, not yet included in the git tracked repo on your local computer or that have been modified since your last commit (e.g., the file you have just changed). With the changed file highlighted, the pane at the bottom summarizes the difference between the current version of the file and the version that was committed previously.
Commit. A window will pop up confirming what you have just done, which you can close.If you now select the History tab you can see the history of commits, and selecting any of the nodes in the commit history will show you (in the lower pane) the files involved in the commit and how the content of those files has changed since your last commit. For example, the node associated with your intial commit will show you the initial file contents, while subsequent nodes highlight where a new version differs from the previous one.
Push button. This updates the remote copy of your repo and makes it available to collaborators.If you have an existing directory on your local computer which is already under git version control, then you simply need to create a new RStudio project for that directory and then version control features will be automatically enabled. To do this:
Choose to create a new project from an Existing Directory
A new .Rproj file will be created for the directory and RStudio’s version control features will then be available for that directory. Now, you can edit files currently in the directory or create new ones as well as stage and commit them to the local repo directly from RStudio. See the section above on Adding to the repository, staging changes, and committing.
So far, however, this project is only under local version control… you will not yet be able to push changes up to GitHub.
New Directory.Empty Project, name the directory and choose where you would like it to be stored, check the box marked Create a git repository, and press Create Project.As for projects created from an existing local directory (see above), this project is still only under local version control… you will not yet be able to push changes up to GitHub.
If you did not create your project repository by cloning a remote repo from GitHub, you can follow the steps below to connect and push a local repo up to GitHub.
Git/SVN icon.RSA Key, click Create RSA Key... and follow through with any additional dialog boxes.View public key, copy the displayed public key, and close out of the Global Options dialog box. If you have created a key previously, just click View public key, copy the displayed public key, and close out of the Global Options dialog box.Edit Profile button), and the select the SSH & GPG keys tab.New SSH key, type a title for the new key, paste in the public key you copied from RStudio, and then click Add SSH key at the bottom of the window.Now you will want to create a remote repository on GitHub to which you can push the contents of your local repository so that it is also backed-up off site and available to collaborators.
SSH button is selected under the Quick setup… option at the top. Then, scroll down to the option: …or push an existing repository from the command line.NOTE: Your prompt will almost certainly look different than that in the image above!
[Alternatively, open a separate terminal window and navigate to the directory of the repo you wish to push.] At the shell prompt, enter…
git remote add origin git@github.com:your GitHub user name/your repo name.git git push -u origin master
The first line tells git the remote URL that you are going to push to, and the second pushes your local repo up to the remote master.
Congratulations! You have now pushed commits from your local repo to GitHub, and you should be able to see those files in your GitHub repo online. The Pull and Push buttons in RStudio should now also work.
Remember: after each Commit, you will have to push to GitHub manually. This does not happen automatically.
HTML as the Default Document Output.You should commit your document locally several times during class and as you complete the coding challenges, and you should push your updated repo to GitHub periodically as well. The instructions above should be helpful in getting you comfortable with commiting changes to your local repo and pushing changes up to GitHub. Feel free to come see me during the week if you run into any problems!
By the start of next week’s class, your updated RMarkdown document addressing all of today’s CHALLENGES should be posted to GitHub and you should “knit” your document to HTML and post that to GitHub as well.
RStudio provides an interface to the most common version control operations including managing changelists, diffing files, committing, and viewing history. While these features cover basic everyday use of git and Subversion, you may also occasionally need to use the system shell to access all of their underlying functionality.
RStudio includes functionality to make it more straightforward to use the shell with projects under version control. This includes:
On all platforms, you can use the Tools -> Shell… command to open a new system shell with the working directory already initialized to your project’s root directory.
On Windows when using git, the Shell command will open Git Bash, which is a port of the bash shell to Windows specially configured for use with Msys Git (note you can disable this behavior and use the standard Windows command prompt instead using Options -> Version Control).
Version control repositories can typically be accessed using a variety of protocols (including HTTP and HTTPS). Many repositories can also be accessed using SSH (this is the mode of connection for many hosting services including GitHub and R-Forge, and is the process described above).
In many cases the authentication for an SSH connection is done using public/private RSA key pairs. This type of authentication requires two steps:
To make working with RSA key pairs more straightforward the RStudio Version Control options panel can be used to both create new RSA public/private key pairs as well as view and copy the current RSA public key.
While Linux and Mac OSX both include SSH as part of the base system, Windows does not. As a result the standard Windows distribution of git (Msysgit, referenced above) also includes an SSH client.